So far we have used linear regressions, where we modeled the outcome as follows:
\[ y \sim Normal(\mu,sigma) \\ \mu = \alpha + \beta X \]
This regression works generally well, even if \(\small y\) is not normally distributed. (The residuals should be though. And a posterior predictive plot is always useful!)
However, for some outcome types a linear regression simply does not work. This is particularity clear when the outcome is bound to be equal to or larger than zero (like counts that “clump” at zero) or when the outcome is binary.
As an example the next figure shows end of year math grades in a Portuguese school.1
df=read.table("data/student-mat.csv",sep=";",header=TRUE)
df = df[df$Medu>0,]
set_par()
x = barplot(c(table(df$G3),0,0,0), xaxt = "n", border = "grey")
axis(1,at = x[seq(0,20,2)+1], labels = seq(0,20,2))
In these situations we can use this kind of model:
\[ y \sim dist(\theta_1,\theta_2) \\ \]
Here \(\small dist(\theta_1,\theta_2)\) is a distribution with two parameters2 that is consistent with the observed data.
Choosing a different distribution means choosing a different
likelihood function. That is, we exchange dnomr
with an alternative distribution that is appropriate for the data.
Here are a few examples:
set_par()
x = seq(0,5,1)
plot(x-.2,dbinom(x, 5, .2),"h", xlim = c(0,20), lwd = 2,
ylab = "probability mass", xlab = "number of success")
x = seq(0,10,1)
lines(x,dbinom(x, 10, .5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dbinom(x, 20, .75),"h", col = "red", lwd = 2)
legend("topright",
lwd = 2, col = c("black","blue","red"),
legend = c(
expression(theta[1]~" = n = 5, "~theta[2]~" = p = .2"),
expression(theta[1]~" = n = 10, "~theta[2]~" = p = .5"),
expression(theta[1]~" = n = 30, "~theta[2]~" = p = .75")
),
bty = "n")
set_par()
x = seq(0,5,1)
plot(x-.2,dpois(x, .5),"h", xlim = c(0,20), lwd = 2,
ylab = "probability mass", xlab = "number of events")
x = seq(0,10,1)
lines(x,dpois(x, 5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dpois(x, 10),"h", col = "red", lwd = 2)
legend("topright",
lwd = 2, col = c("black","blue","red"),
legend = c(
expression(theta[1]~" = "~lambda~" = p = .5"),
expression(theta[1]~" = "~lambda~" = p = 5"),
expression(theta[1]~" = "~lambda~" = p = 10")
),
bty = "n")
As can be seen from the previous plots, the expected value (the mean parameter) of the Binomial and Poisson distributions are constrained:
But if we would just model e.g. \[ p = \alpha + \beta X \] we could get values between minus and plus infinity. Therefore the generalized linear model also used link functions that map the result of the linear predictor \(\small \alpha + \beta X\) to the desired range:
\[ p = link(\alpha + \beta X) \]
Different distribution use different link function to implement different constraints:
\[ \small inv.logit(x) = \frac{1}{1+exp(-x)} = \frac{exp(x)}{1+exp(x)} \]
set_par()
curve(boot::inv.logit(x),-5.1,5.1,n = 100,
xlab = expression(alpha~" + "~beta~"X"), ylab = "p")
set_par()
curve(exp(x),-3,4,n = 100,
xlab = expression(alpha~" + "~beta~"X"), ylab = expression(lambda))
One important effect of link functions is that we cannot interpret regression weights in the same way as for simple linear regression. In a nutshell, if link functions exponentiate the linear predictor, regression weights represent multiplicative rather than additive effects.
In R, family is the term for an object that
tells a regression model both what outcome distribution to use and what
link function to use.
When do we use the Binomial and when the Poisson Likelihood?
For a hands on example for logistic regression we will try to predict failing the math class with the Portugese school data shown above. In particular, we will use these predictors:
It is always a good idea to plot the data first:
data.list = list(
Medu = df$Medu,
failures = df$failures,
fail = df$G3 == 0
)
set_par(mfrow = c(1,2))
table(data.list$Medu) %>% barplot(border = "grey", ylab = "count", xlab = "maternal edu")
table(data.list$failures) %>% barplot(border = "grey", xlab = "number fails")
We are de-meaning Medu, so that the intercept measures
odds of average maternal education.
data.list$cMedu = data.list$Medu-2.5
We beging the analysis with an intercept model only:
model = alist(
fail ~ dbinom(1,p),
logit(p) <- a,
a ~ dnorm(0,10)
)
And we estimate the model with ulam (i.e. Stan):
u.fit_I = ulam(
model,
data = data.list,
iter = 2000, # 2000 iterations, (1000 warmup)
chains = 4, # four chains, to check convergence
cores = 4, # use 4 cores in parallel
cmdstan = TRUE) # use cmdstanr not rstan
We extract the prior and plot the prior prediction for the failure rate:
prior = extract.prior(u.fit_I)
set_par()
hist(inv_logit(prior$a), main="")
To see what is going on here, lets overlay the logistic function on the prior:
set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)
curve(dnorm(x,0,10),-30,30, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.04, length.out = 5),
labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/25, add = T, col = "red")
abline(h = .1/25, lty = 3, col = "grey")
title("a ~ dnorm(0,10)")
curve(dnorm(x,0,2),-6,6, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.2, length.out = 5),
labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/5, add = T, col = "red")
abline(h = .1/5, lty = 3, col = "grey")
title("a ~ dnorm(0,2)")
With a wide prior, most of the probability mass is smaller than -5, leading to p very close to 0, or larger than 5, leading to p close to 1.
How about the regression coefficients \(\small beta\)? Regression coefficients are log odds ratios.
For instance, lest assume the following:
then the odds ratio is :
\[ \begin{align*} OR = & \frac{\textrm{failure odds no prior fail}}{\textrm{failure odds prior fail}} \\ = & \frac{\frac{2}{98}}{\frac{4}{96}} = \frac{.0204}{.0417} = .49 \end{align*} \]
Note that this specific odd ratio comes close to the risk ratio of \(\small .2 / .4 = .5\).
This is, however, not generally the case. If the probabilities are far away from zero, risk ratio and odds ratio do not align! We can see this if we just multiply the probability of failure by 15 in both groups:
\[ \begin{align*} OR = \frac{\frac{30}{70}}{\frac{60}{40}} = \frac{.428}{.667} = .29 \end{align*} \]
Lets go back to specifying our prior for \(\small \beta\):
dnorm(0,.7) prior.dnorm(0,2) prior.Lest specify such a model and look at the prior predictions for the difference between two levels of education:
model = alist(
fail ~ dbinom(1,p),
logit(p) <- a + b1*failures,
a ~ dnorm(0,2),
b1 ~ dnorm(0,2)
)
u.fit_F = ulam(
model,
data = data.list,
iter = 2000, # 2000 iterations, (1000 warmup)
chains = 4, # four chains, to check convergence
cores = 4, # use 4 cores in parallel
cmdstan = TRUE) # use cmdstanr not rstan
prior = extract.prior(u.fit_F)
Now we can use the link_ulam function and new data to
generate predictions from the prior. First we look at the effect of a
one level change of education.
p = link_ulam(
u.fit_F, post = prior,
data = list(failures = c(0,1)))
set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)
hist(p[,2]-p[,1],
main = "risk difference", xlab = "P(fail|high) - P(fail|low)")
hist(p[,2]/p[,1], breaks = 30,
main = "risk ratio", xlab = "P(fail|high) / P(fail|low)")
These differences and ratios are pretty large, so it is safe to make the prior a bit narrower and set the standard deviation for the regression weights to 1.
Here is the our model so far, which include previous class fails and maternal education as predictors:
model = alist(
fail ~ dbinom(1,p),
logit(p) <- a + b1*failures + b2*cMedu,
a ~ dnorm(0,2),
b1 ~ dnorm(0,1),
b2 ~ dnorm(0,1)
)
u.fit_FE = ulam(
model,
data = data.list,
log_lik = TRUE, # for model comparison
iter = 2000, # 2000 iterations, (1000 warmup)
chains = 4, # four chains, to check convergence
cores = 4, # use 4 cores in parallel
cmdstan = TRUE) # use cmdstanr not rstan
Some quick convergence diagnostics:
precis(u.fit_FE, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## a -2.5837038 0.2204297 -2.9542099 -2.24176580 2273.241 0.9992693
## b1 0.6900335 0.1770531 0.4095241 0.98004828 2358.207 1.0000829
## b2 -0.2939670 0.1704245 -0.5669423 -0.02034664 3005.834 1.0000001
To see if our model describes the data well, we plot model predicted and observed data together:
aggr.fun = function(x) return(c(mean = mean(x), N = length(x)))
obs =
aggregate(
data.list$fail,
by = with(data.list,
data.frame(cMedu = cMedu, Medu = Medu, failures = failures)),
aggr.fun
)
obs = cbind(obs,obs$x)
sim.data =
unique(as.data.frame(data.list)[,c(2,4)])
p = link_ulam(
u.fit_FE,
data = sim.data)
pp.stats = cbind(
sim.data,
m = colMeans(p),
t(apply(p,2,PI))
)
set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp =
lapply(unique(obs$cMedu), function(x) {
plot(obs[obs$cMedu == x,"failures"],
obs[obs$cMedu == x,"mean"],
cex = sqrt(obs[obs$cMedu == x,"N"]),
xlim = c(-.5,3.5), ylim = c(0,.5),
ylab = "proportion fail",
xlab = "# past failures",
xaxt = "n", pch = 16, col = "grey")
title(paste("maternal edu",x+2.5),line = 0.5, cex.main = 1)
axis(1,at = 0:3)
points(pp.stats[pp.stats$cMedu == x,"failures"],
pp.stats[pp.stats$cMedu == x,"m"], pch = 16, col = "blue")
arrows(pp.stats[pp.stats$cMedu == x,"failures"],
y0 = pp.stats[pp.stats$cMedu == x,"5%"],
y1 = pp.stats[pp.stats$cMedu == x,"94%"],
length = 0, col = "blue")
})
This does not look great, we can try to improve the model in two ways:
model = alist(
fail ~ dbinom(1,p),
logit(p) <- a[Medu] + b[Medu]*failures,
a[Medu] ~ dnorm(0,2),
b[Medu] ~ dnorm(0,1)
)
u.fit_FE.2 = ulam(
model,
data = data.list,
log_lik = TRUE, # for model comparison
iter = 2000, # 2000 iterations, (1000 warmup)
chains = 4, # four chains, to check convergence
cores = 4, # use 4 cores in parallel
cmdstan = TRUE) # use cmdstanr not rstan
Did the chains converge?
precis(u.fit_FE.2)
## 8 vector or matrix parameters hidden. Use depth=2 to show them.
## [1] mean sd 5.5% 94.5% n_eff Rhat4
## <0 rows> (or 0-length row.names)
This looks OK, so we do again a posterior predictive check:
sim.data =
unique(as.data.frame(data.list)[,c(1,2)])
p = link_ulam(
u.fit_FE.2,
data = sim.data)
pp.stats = cbind(
sim.data,
m = colMeans(p),
t(apply(p,2,PI))
)
set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp =
lapply(unique(obs$Medu), function(x) {
plot(obs[obs$Medu == x,"failures"],
obs[obs$Medu == x,"mean"],
cex = sqrt(obs[obs$Medu == x,"N"]),
xlim = c(-.5,3.5), ylim = c(0,.5),
ylab = "proportion fail",
xlab = "# past failures",
xaxt = "n", pch = 16, col = "grey")
title(paste("maternal edu",x),line = 0.5, cex.main = 1)
axis(1,at = 0:3)
points(pp.stats[pp.stats$Medu == x,"failures"],
pp.stats[pp.stats$Medu == x,"m"], pch = 16, col = "blue")
arrows(pp.stats[pp.stats$Medu == x,"failures"],
y0 = pp.stats[pp.stats$Medu == x,"5%"],
y1 = pp.stats[pp.stats$Medu == x,"94%"],
length = 0, col = "blue")
})
This does not look that much better. Let’s check this with PSIS-LOO:
compare(u.fit_FE.2,u.fit_FE, func = "PSIS")
## PSIS SE dPSIS dSE pPSIS weight
## u.fit_FE 232.9903 25.47049 0.000000 NA 3.039043 0.97519255
## u.fit_FE.2 240.3333 25.64213 7.342982 3.236801 7.300440 0.02480745
In this case the added complexity has not helped much. So lets look at the parameters of the first model:
plot(precis(u.fit_FE, depth = 2))
summary = precis(u.fit_FE, depth = 2) %>% round(2)
summary
## mean sd 5.5% 94.5% n_eff Rhat4
## a -2.58 0.22 -2.95 -2.24 2273.24 1
## b1 0.69 0.18 0.41 0.98 2358.21 1
## b2 -0.29 0.17 -0.57 -0.02 3005.83 1
What does this all mean?
Because the parameters are log-odds, we exponentiate them to get OR. Because the probability of our outcome is (relatively) close to zero, we can interprete the OR as approximate risk ratios. Hence
The OR and relative risks that we have to deal with due to the link function make it hard to interpret the results. But if we create posterior predictions we can simply calculate contrasts.
Lets look at the data we used to generate posterior predictions:
sim.data =
unique(as.data.frame(u.fit_FE@data)[, c("cMedu","failures","Medu")])
sim.data = sim.data[order(sim.data$cMedu,sim.data$failures),]
rownames(sim.data) = NULL
sim.data
## cMedu failures Medu
## 1 -1.5 0 1
## 2 -1.5 1 1
## 3 -1.5 2 1
## 4 -1.5 3 1
## 5 -0.5 0 2
## 6 -0.5 1 2
## 7 -0.5 2 2
## 8 -0.5 3 2
## 9 0.5 0 3
## 10 0.5 1 3
## 11 0.5 2 3
## 12 0.5 3 3
## 13 1.5 0 4
## 14 1.5 1 4
## 15 1.5 2 4
If we now generate the posterior predictions, we get a matrix in
which each column corresponds to one row in sim.data
pp = link_ulam(
u.fit_FE,
data = sim.data)
dim(pp)
## [1] 4000 15
A simple question would be what the effect of having one vs zero
previous failures is when maternal educational level is 1. We simply
look up in the sim.data table that we need to compare
columns one and two in the posterior predictions for this. In the same
manner we can do this for maternal educational level 2.
delta.L1 = pp[,2] - pp[, 1]
delta.L2 = pp[,6] - pp[, 5]
xlim = range(c(delta.L1,delta.L2))
breaks = seq(xlim[1]-.01,xlim[2]+.01, length.out = 25)
set_par(mfrow = c(3,1))
hist(delta.L1, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2-delta.L1, main = "")
abline(v = 0, col = "red", lty = 3)
The third histogram shows an interaction effect on the scale of risk differences, even though our model has no interaction term! This is due to the exponent in the link function!
If we for instance want to see the difference in fail probability between maternal education levels 2 and 3, we calculate one vector of samples with fail probabilities at level 1, one for level 4, and subtract them:
fail.level1 = rowMeans(pp[,1:4])
fail.level4 = rowMeans(pp[,13:15])
delta = fail.level4 - fail.level1
set_par()
hist(delta)
However, this analysis assumed that the pupils with different number of previously failed classes are equally frequent, which is not the case. So we need to weight with these probabilities:
M1_p.failures = prop.table(table(df$failures[df$Medu == 1]))
M4_p.failures = prop.table(table(df$failures[df$Medu == 4]))
fail.level1 = pp[,1:4] %*% M1_p.failures
fail.level4 = pp[,13:15] %*% M4_p.failures
delta = fail.level4 - fail.level1
set_par()
hist(delta)
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Data can be downloaded here↩︎
There are also distributions with 1 or 3 or more parameters↩︎